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Abstract 

Background: Alignment of biological sequences such as DNA, RNA or proteins is one of the most widely used 
tools in computational bioscience. All existing alignment algorithms rely on heuristic scoring schemes based on 
biological expertise. Therefore, these algorithms do not provide model independent and objective measures for 
how similar two (or more) sequences actually are. Although information theory provides such a similarity measure 
- the mutual information (Ml) - previous attempts to connect sequence alignment and information theory have 
not produced realistic estimates for the Ml from a given alignment. 

Results: Here we describe a simple and flexible approach to get robust estimates of Ml from global alignments. For 
mammalian mitochondrial DNA, our approach gives pairwise Ml estimates for commonly used global alignment 
algorithms that are strikingly close to estimates obtained by an entirely unrelated approach - concatenating and 
zipping the sequences. 

Conclusions: This remarkable consistency may help establish Ml as a reliable tool for evaluating the quality of 
global alignments, judging the relative merits of different alignment algorithms, and estimating the significance 
of specific alignments. We expect that our approach can be extended to establish further connections between 
information theory and sequence alignment, including applications to local and multiple alignment procedures. 



Background 

Sequence alignment achieves many purposes and comes in several different varieties [1]: Local versus global 
(and even "glocal": [2]), pairwise versus multiple, and DNA/RNA versus proteins. Rather than listing all 
applications, we cite just two numbers: The two original papers on the BLAST algorithm for local alignment 
by [3] and on one of its improvements [4] have been cited more than 43,000 times, and the number of daily 
file uploads to the NCBI server providing BLAST is ~ 140, 000 [5]. A partial list of alignment tools in the 
public domain can be found in http://pbil.univ-lyonl.fr/alignm ent.htmll 
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In global alignment, which we focus on here, two sequences of comparable length are placed one below 
the other. The algorithm inserts blanks in each of the sequences such that the number of positions at which 
the two sequences agree is maximized. More precisely, a scoring scheme is used where each position at which 
the two sequences agree is rewarded by a positive score, while each disagreement ("mutation") and each 
insertion of a blank ( "gap" ) is punished by a negative one. The best alignment is that with the highest total 
score. In local alignment, one aligns only subsequences against each other and looks for the highest scores 
between any pairs of subsequences. Regions that cannot be well-aligned are simply ignored. Existing codes 
use either heuristic scoring schemes or scores derived from explicit probabilistic models [6]. 

In either case, the absolute value of the score cannot be used to judge reliably the quality or the sig- 
nificance of an alignment. As a result, significance is typically estimated by aligning random sequences 
( "surrogates" ) and comparing the distribution of scores between these surrogates to the scores between the 
true biological sequences. Significance estimates are particularly relevant when aligning a sequence of interest 
against an entire data bank, in order to find a homologue. In that case wrong estimates for the tail of the 
distribution of pairwise "similarities" could render the results worthless. 

In this context - and in many others - an objective measure for similarity between two biological sequences 
would be extremely useful. Such an objective measure is provided by information theory [7]. Roughly, the 
complexity K(A) of a sequence A is the minimal amount of information (measured in bits) needed to specify 
A uniquely. For two sequences A and B, the conditional complexity (or conditional information) K(A\B) is 
the information needed to specify A, if B is already known. If A and B are similar, this information might 
consist of a short list of changes needed to go from B to A, and K(A\B) is small. If, on the other hand, 
A and B have nothing in common, then knowing B is useless and K(A\B) = K(A). Finally, the mutual 
information (MI) is defined as the difference I(A; B) = K(A) — K(A\B). It is the amount of information 
which is common to A and B, and is also equal to the amount of information in B which is useful for 
describing A, and vice versa. Indeed, it can be shown that, up to correction terms that become negligible for 
long sequences (see [7]): (a) I(A; B) > 0; (b) I(A; B) = if and only if A and B are completely independent; 
(c) I(A; A) = K(A); and (d) I(A; B) = I(B; A). Moreover, the likelihood that A and B arose independently 
is p = 2~ / (' 4;B ) (see [8]). Hence, the similarity is significant and not by chance when I(A; B) is large. 

The fact that alignment and information theory are closely related has been realized repeatedly. However, 
most work in this direction has focussed on aligning images rather than sequences [9]. Conceptually, these 
two problems are closely related, but technically, they are not. The effects of sequence randomness on 
the significance of alignments has also been studied in [10]. Finally, attempts to extend the notion of edit 
distance [1] to more general editing operations have been made. In this case the similarity of two sequences 
is quantified by the complexity of the edit string, see [11]. Indeed, the aims of [11] arc similar to ours, but 
their approach differs in several key respects and leads to markedly different results. 

Methods 

Translation String 

At the heart of our approach is the concept of a translation string. The translation string Tb\a contains 
the information necessary to recover the sequence B from another sequence A. Similarly, Ta\ b contains the 
information needed to obtain A from B. Here we focus on DNA sequences, consisting of the letters A,C,G 
and T, and corresponding to complete mitochondrial genomes. But the approach is more general and can 
be applied to protein sequences without further effort. We refer to the i th element of sequence X as Xi, and 
denote the length of X as nx ■ Any global alignment algorithm, when applied to A and B, outputs a pair of 
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A 1 GGGAGATATAGCCGTTATGTGAATAGCACTGAATCAAGCG GCGTACG 

B ' GGGACATATAG TATGTAAATATCACTGAATCGAACGAACACGCACGTGTACG 

T A | B 00003000000CCGT0000010000200000000010100 0100000 

T B | A 00003000000 0000010000200000000010100AACACGCAC0100000 

Figure 1: Example of an alignment and of the two translation strings Ta\b an d T b \a- Colors indicate sites 
with mutations (red), gaps (blue), and conservation (green). 

sequences (A',B') of equal length n > max{n A , n B }. The sequences A' and B' are obtained from A and B 
by inserting hyphens ("gaps") such that the total score is maximized. The strings T b \a an d T A \ B also have 
length n, and are composed from an alphabet of nine characters. For each i, the letter T B \ Ai is a function 
of A\ and B[ only. An example of this process is found in Figure [TJ the rules to create T b \a are as follows: 

• if A\ = Bl, then T B \ A>i = 0; 

• if A[ is a hyphen (gap), then T B \ Ai has to specify explicitly what is in B\ hence T B \ Ai = B[ £ 
{A,C,G,T}; 

• if B'i is a hyphen (gap), then T B \ Ai has to indicate that something is deleted from A' , but there is no 
need to specify what. Hence T B \ A ,i = B[ = — ; 

• if A\ — > B[ is a transition, i.e. a substitution A^G or C^T, then Tb\a,% = lj 

• if A[ — > B[ is a transversion A<->C or T<->G, then T B \A,i = 2; 

• if — > B[ is a transversion A^T or G^C, then Tb\a,i = 3. 

Tb|4 is defined such that -B' (and thus also B) is obtained uniquely from A'. But A' can be obtained from 
A using T b \a- Thus T b \a does exactly what it was intended to do: it allows one to recover B from A. It 
does not, however, allow one to recover A from B. Due to the second and third bullet points above, T B \ A is 
not the same as Ta\b- This distinguishes our approach from typical edit string methods. 

Mutual Information 

An estimate of the conditional complexity K(B\A) is obtained by compressing T b \a using any general 
purpose compression algorithm such as zip, gzip, bzip2, etc. In the results shown here we use lpaql [12]; see 
also this reference for a survey of public domain lossless compression algorithms). Denoting by comp(A) the 
compressed version of A and by len[A] the length of A in bits, gives an exact upper bound 

K{B\A) < len[comp(T B | A )]. (1) 

In order to obtain an estimate of MI, we have to subtract -KT(-B|j4) from K(B), which is also estimated via 
compression. However, unlike T b \a, B is a DNA string. Since general purpose compression algorithms are 
known to be inferior for DNA [13, 14] we use an efficient DNA compressor called "XM" [14]. The resulting 
MI estimate is 

I(A;B) a I (A; B) , Vlgu = len[XM(S)] - lenflpaqlCT^)]. (2) 

At first sight it might seem paradoxical that I (A; -B) a iign can even be positive. Not only does T b \a involve 
a larger alphabet than B, but, in general, it is also a longer string. Thus one could expect that Tgu would 
not typically compress to a shorter size than B. The reason why this first impression is wrong is clear from 
Figure [TJ If A and B are similar, then T b \a consists mostly of zeroes and compresses readily. In practical 
alignment schemes, the scores for mismatches are carefully chosen such that more frequent substitutions are 
punished less than unlikely substitutions. In contrast, coding each mismatch simply by a letter in T b \a seems 
to ignore this issue. However, more frequent mismatches will give letters occurring with higher frequency, 
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and general purpose compression algorithms utilize frequency differences to achieve higher compression. 

Conceptually our approach is similar to encoding of generalized edit strings in [11]. However, there are 
several pivotal differences between that work and ours. First, the authors in [11] did not compress their edit 
strings and as a result the conclusions they were able to draw from a quantitative analysis were much weaker 
than ours. Second, our approach utilizes an alignment algorithm to achieve an efficient encoding of Tgu. 
In addition to producing a better estimate of K(B\A), this allows us to make quantitative evaluations of 
the algorithm itself. An additional difference between our approach and the traditional edit methods used 
in approximate string matching [15] is that our translation strings do not give both translations A — > B and 
B — > A from the same string. This asymmetry is crucial to establish the relations to conditional and mutual 
information. 



For long strings, I (A; B) should be symmetric in its arguments. In general, the estimates satisfy 
J(A;B) a ii gn w I(B; ^4) a iign (see Figure S3 in the supplementary material). Indeed, the translation strings 
T B \a and Tmb can differ substantially, resulting in different estimates for K(B\A) and K(A\B) via Eq. (fl]). 
This difference is mostly cancelled by differences between len[XM(£?)] and len[XM(A)]. Take, for instance, 
the case where B is much shorter than A. Then T^\a consists mostly of hyphens and is highly compressible. 
On the other hand, T^ B is similar to A, since most letters have to be inserted when translating B to A. 
Thus both I(A; B) a \ lgn and I(B; A) align are small compared to K (A), but for different reasons. Further 
details are given in the supplementary material. 

Tools 

We utilized the MAVID [16] and Kalign [17] global sequence alignment programs available for download 
at [18] and [19]. We also experimented with STRETCHER [20], lagan [21] and CLUSTALW 2 [22], and 
observed similar results. We have made no efforts here to optimize the scoring parameters of the algorithms 
used and have only used the defaults. 

For DNA string compression we utilized the expert model (XM) DNA compression algorithm [14]. For 
compression of the translation strings we used lpaql [12]. Using the lpaql was not crucial, with the standard 
LINUX tools gzip and bzip2 producing similar results. For DNA we also explored GenCompress [23] and 
bzip2. Both showed markedly inferior results to XM (see supplementary information). 

The complete mtDNA sequences used in our analysis were downloaded from [24]. They included 220 
mammals, 25 non-mammalian vertebrates, and 20 invertebrates. 

Results 

In Figure [2] we compare two MI estimates for pairs of species from various groups of animals. The first 
estimate is obtained using the MAVID alignment tool [16] followed by compression, while the second is 
obtained by compression alone [23,25,26], without using any alignment algorithm. The latter estimate is 
made by comparing the size of the compressed concatenation AB to the sum of the sizes of the compressed 
individual files, 

I (A; B) compl = len[XM(A)] + len[XM(B)] - len[XM(AB)]. (3) 
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Figure 2: Scatter plot of MI estimates for complete mitochondrial DNA between pairs of species: I com p r 
using XM [14] vs. /align using MAVID [16] followed by compression. Note that the two estimates generally 
agree and fall on the diagonal, while in some cases one method does better than the other as explained in 
the text. Here and in Figure [3] "vertebrata" means non-mammalian vertebrata. See Tools for a breakdown 
on the number of mammals, vertebrata and invertebrata. 

Although it is not possible to prove that I compr or /align are lower bounds for the true MI, generally it is 
expected that both J compr and / a ii gn underestimate the true MI. 

In Figure [5] we find that both MI estimates are approximately equal, despite the fact that alignment al- 
gorithms and compression algorithms follow drastically different routes. Points above the diagonal indicate 
that concatenation and compression - using the XM algorithm - produced a better estimate of MI, while 
points below indicate that MAVID alignment followed by compression of its translation string produced a 
better estimate. Different results are found by compressing with compression algorithms other than XM 
(see supplementary material). In that case a vast majority of the points fall far below the diagonal. The 
invertebrate-invertebrate pairs far above the diagonal in Figure [3 correspond to pairs of species where the 
individual genes are similar, but their ordering is changed. In that case a compression algorithm is superior 
to a global alignment algorithm, since it is not affected by shuffling the open reading frames (ORFs). Most 
negative estimates for MI seen in Figure [5] represent cases where shuffling the ORFs prevented reasonable 
global alignments. Results could have been improved in such cases by masking part of the genome, but we 
have not tried this. 



MI estimates obtained using other global alignment algorithms are similar to those obtained with MAVID; 
an example is shown in Figure [3] Since neither scoring scheme was optimized to obtain this data, we do 
not consider this figure to indicate which of the two alignment algorithms is better. Rather, it represents 
a proof of principle that our method can be used to identify strengths and weakness of different alignment 
algorithms and evaluate objectively the similarity of any sequence alignment. 



■5 




-0.5 0.5 1 1.5 2 2.5 3 3.5 4 

MI estimated using the MAVID 
alignment algorithm (kbyte) 



Figure 3: Scatter plot comparing alignment based MI estimates for the same pairs of species as in Figure [2] 
Kalign [17] vs. MAVID [16]. Points on the diagonal indicate agreement between the two estimates. These 
data were generated using the default scoring parameters. Therefore, the plot represents a proof of principle 
rather than a definitive statement about the quality of the two alignment algorithms shown. 

Discussion 

Several generalizations and improvements are feasible and are listed below: 

(1) Use more efficient encodings of the translation string. For instance, we only used the letters A' t and 
Tb\a,i to get but one could also use e.g. ^4-_ 1 ,B-_ 1 , and/or Tg|A,<-i< 

(2) Use local alignments instead of global ones. In a local alignment between sequences A and B, large 
parts of B are not aligned with A at all and are encoded without reference to A. Only the aligned parts give 
information from A that can be used to recover B. Before making the jump from global to local alignments, 
an intermediate step would be a "glocal" alignment tool such as shuffle-lagan ( "slagan" ) of [2] . 

(3) Construct objective measures based on information theory for the quality of multiple alignments. A 
straight-forward measure is the information about sequence C obtained from aligning it simultaneously with 
A and B. Assume e.g. that the sequences A and B are much more similar to each other than either A and 
C or B and C, as for human, chimpanzee, and chicken. In order to measure the MI between chicken and the 
primates, one could first align A and B and then align, in a second step, C to the fixed alignment (A, B). 

Conclusions 

By showing that mutual informations between two sequences can be easily estimated from alignments, we 
have established a direct link between sequence alignment and Kolmogorov information theory. Technically, 
we have dealt only with pairwisc global alignment, but at least the basic concepts should have much wider 
applicability. We hope that our work will be important both for the conventional (alignment-based) approach 
to sequence comparison and for the more recent approach based on compression and concatenation based 
on Kolmogorov theory. 

The accuracy of MI estimates based entirely on compression and concatenation depends crucially on the 
quality of the compression algorithm (see Figures SI, S2). Indeed, Figure 2 shows that alignment based 
estimates can be superior to those based on compression alone, but it also shows in other cases the latter 
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to be superior. It is an open question whether alignment-free algorithms for sequence comparison [27] will 
become more widely used, will eventually displace alignment-based algorithms, or whether both approaches 
will merge into a unified approach. In any case, tools to compare the successes and failures of either approach 
will be crucial. 
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Supplementary Material 



Blasting and Zipping: Sequence Alignment and 
Mutual Information 

0. Penner, P. Grassberger, and M. Paczuski 



1) Mutual information (MI) provides an absolute and 
objective measure of the similarity between any two se- 
quences A and B built from a finite alphabet. The main 
result of our paper is that MI can be reliably estimated 
from alignments in general, and from global alignments 
in particular. This is done by first using an alignment 
between A and B to construct two 'translation strings', 
Tb\a an d Ta\b, which allow B to be uniquely recon- 
structed from A, and A from B, respectively. After 
compression, the lengths of these strings give estimates 
of the conditional algorithmic informations K(B\A) and 
K(A\B). Finally, the (algorithmic) MI is estimated by 
means of the general relations [? ] 

I(A\ B) = K(A)-K(A\B), I(B; A) = K(B)-K(B\A), 

.(1) 

where K{A) and K{B) are the algorithmic informations 
(also called "Kolmogorov-Chaitin complexities" ) of the 
sequences A and B. These are estimated by the lengths 
of compressed versions of A and B. A central result of 
algorithmic information theory is that I(A; B) — I(B; A) 
up to terms 0(log N), where N is the length of the con- 
catenation AB. [? ] 

The mutual informations estimated this way, denoted 
Malign (see Eq. (3)), can be compared to estimates of MI 
obtained without using any alignment. The latter is ob- 
tained by comparing the combined lengths of the com- 
pressed versions of A and B to the length of a compressed 
version of the concatenation AB. We denote this J com pr 
and it is given by Eq. (4). Our main result in the paper is 
that both estimates, despite being independent and fol- 
lowing different strategies, yield very similar results for 
mitochondrial DNA (mtDNA) of vertebrates. More pre- 
cisely, they give practically identical estimates for species 
within the same family. /align is typically slightly larger 
for species in the same class but in different families, 
while icompr is larger for species in different phyla, a case 
where global alignment algorithms break down. 

2) The estimate I compr is the standard estimate for the 
MI between two strings, and has been used recently in 
a large number of biological and non-biological problems 
[?????]. Most of the work done with / compr 
has focussed on the clustering of sequences and building 
phylogenetic trees, however it seems that this activity has 
met with considerable skepticism in the biological com- 
munity. One reason is that it appears that no biological 
knowledge is incorporated by estimating J compr . This is 
in stark contrast to the substantial amount of detailed 
knowledge that has gone into the construction of phylo- 
genies based on alignment methods. Another reason is 
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Fig. S2: Similar to Fig. SI, but Icompr using XM 
/compr using GenCompress. 



that even the order of magnitude of I compr depends on 
the quality of the compression algorithms used. While 
two compression algorithms can be easily compared via 
the lengths of the compressed files they produce, it is 
practically impossible to judge the merits of a compres- 
sion algorithm on an absolute scale. Most non-trivial se- 
quences like DNA, proteins, music, or written language 
show very long and structurally complex long-range cor- 
relations. To obtain good estimates of / compr it is cru- 
cial that these long-range effects are correctly taken into 
account; however, the precise structure and the related 
effect on the compressibility are usually not known. 

It is generally accepted that up to approximately five 
years ago the most advanced general purpose compres- 
sion algorithms were not very efficient for DNA [? ? 
]. However, there has been substantial progress during 
the last five years in this field [? ]. Nevertheless, we 
found that none of the recent general purpose algorithms 
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we checked (lpaql, durilca, bbb, dark [? ]) gave better 
compression rates for DNA than GenCompress [? ] , one 
of the best public domain DNA compression algorithm 
known prior to 2007. A typical plot (/ compr using lpaql 
versus I compr using GenCompress for mtDNA) is shown 
in Fig. SI. 

On the other hand, with the development of XM [? ] 
there has been important progress in biosequence (DNA, 
proteins) specific compression algorithms. According to 
the authors of Ref. [? ] , the improvement brought by XM 
over previous DNA compression algorithms (such as Gen- 
Compress) is only a few per cent, when single sequences 
are considered. However, we found that the improvement 
in estimates of I compr is much larger, presumably because 
long range correlations play a much bigger role there. Re- 
sults for Icompr using GenCompress versus I compr using 
XM are shown in Fig. S2. We observe vast differences, 
except for very closely related species, where both com- 
pression algorithms are able to detect the strong similar- 
ity. For species in different families, XM gives typically 
three to five times larger Mis than GenCompress. Note 
that we do not have a rigorous proof that larger values of 
^compr are better. The difficulty is that J compr is the dif- 
ference between terms which are all overestimated. But 
the negative term, corresponding to K(AB), is most diffi- 
cult to estimate because AB is much longer than A or B. 
Thus K(AB) is most likely to be strongly overestimated. 
As such, it follows that larger values of J compr indicate 
improved treatment of long-range interdependencies and 
better estimates of MI. 

The strong dependence on compression algorithm ob- 
served in Fig. S2 seems to justify skepticism against the 
use of Icompr- This is contradicted, however, by the fact 
that the values of J compr obtained with XM are in very 
good general agreement with the values of /align, as shown 
in Fig. 2. The latter suggests that it should be possible 
to improve J compr further by a factor ps 2 for species in 
different phyla, but probably not by more. 

3) An important aspect of our treatment is that the 
translation strings Tb\a and Ta\b are different. Thus 
the conditional informations K(B\A) and K(A\B) are 
also different. This is in stark contrast to the notion of 
edit distances [? ], where one typically wishes to define 
a symmetric distance measure directly via the number 
and costs of edit operations. In our treatment, on the 
other hand, the symmetric distance is obtained in further 
steps, by first going over to Mis and then, if one wishes, 
by deriving universal compression distances from the MI 
[? ]. Although this, at first, appears more complicated, 
our method has the advantage of providing a direct link 
with information theory. 

A crucial numerical requirement for our formalism is 
that the estimates / a ii gn should be symmetric under the 
exchange of the sequences within terms of O(logiV), 
I(A;B) a i ign ps J(B;A)aiign- Any strong violation of 
this symmetry would indicate that either the construc- 
tion of the translation string is not optimal, or that 
the compression algorithm used in Eq. (3) is far from 




[l(A,B) + l(B,A)]/2 [KBytes] 



Fig. S3: Scatter plot of asymmetries of /align and of condi- 
tional informations, both versus /align- Pairs of species are 
drawn only from mammals. 

A GGGAGATATAGCCGTTATGTGAATAGCACTGAATCAAGCGGCGTACGCGTTATGTGA 
B 

A ' GGGAGATATAGCCGTTATGTGAATAGCACTGAATCAAGCGGCGTACGCGTTATGTGA 

B ' 

T A | B GGGAGATATAGCCGTTATGTGAATAGCACTGAATCAAGCGGCGTACGCGTTATGTGA 

T B | 4 

Fig. S4: Alignment and translation strings for comparing a 
random sequence with an empty sequence. Here, sequence A 
is random and of length n, while B is empty. As explained in 
the text, the estimated Mis I(B; A) a n gn and I(A; B) a n gn agree 
with the expected results to within terms of order 0(log?i). 



perfect. In contrast it is not required that K(B\A) 
is symmetric. To show that I(A; B) a n gn is more sym- 
metric than K(B\A), we plot in Fig. S3 the differ- 
ences \I{A;B) align -I{B;A) aligD \ and \K{B\A)—K{A\B)\ 
against I(A; B) a i ign . For Fig. S3 we used only mam- 
malian mtDNA because the estimates I(A; Z?) a ii g n for 
species in different classes are too uncertain for a mean- 
ingful analysis. We see that there are no problems at 
all for closely related species as in such cases, both 
I(A;B) a \i gn and K(B\A) are symmetric. For more dis- 
tant species, both are still symmetric for the majority of 
pairs, but there are also numerous outliers where K(B\A) 
is strongly asymmetric. In all those cases the asymme- 
try of I{A] Z3) a iign is significantly smaller than that of 
K(B\A). 

4) In the paper we have argued qualitatively how the 
symmetry of /(A; -B) a ii g n is compatible, in view of Eq. (3), 
with even very asymmetric values of K(B\A). Here we 
shall discuss some extreme cases quantitatively and ex- 
actly. 

a) Assume that the string A is a random string over 
the alphabet {^4, C, G, T} of length n, and the string B 
is the empty string. Then the optimal alignment is as 
shown in Fig. S4. In order to specify T b \a one must 
encode the letter "-" and the number n of repetitions, 
giving K(B\A) ps log 2 n bits, while K{B) = 0. On the 
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A GGGCGCTCTCGAAGTTCTGTGCCTAGTT 

B CCCCCCCCCCCCCCCCCCCCCCCCCCCC 

A' GGGCGCTCTCGAAGTTCTGTGCCTAGTT 

B' CCCCCCCCCCCCCCCCCCCCCCCCCCCC 

T A | B GGGCGCTCTCGAAGTTCTGTGCCTAGTT 

T B |a CCCCCCCCCCCCCCCCCCCCCCCCCCCC 

Fig. S5: Alignment and translation strings for comparing a 
random sequence with a sequence composed of only one letter. 
Here, sequence A is random of length n, while B is a string 
of length n consisting only of "C". As explained in the text, 
the estimated Mis I(B; A) a n gn and I(A; B) a iign agree with the 
expected results to within terms of order O(logn). 

A CCCTCCCCGCCCACCCCTCCCCCGCCAA 
B CCCCCCCCCCCCCCCCCCCCCCCCCCCC 

Alignment 1 
A' CCCTCCCCGCCCACCCCTCCCCCGCCAA 
B ' CCCCCCCCCCCCCCCCCCCCCCCCCCCC 
T A | B 0001000030002000010000030022 
T B |i 0001000030002000010000030022 

Alignment 2 

A' CCCTCCCCGCCCACCCCTCCCCCGCCAA 

B" CCCCCCCCCCCCCCCCCCCCCCCCCCCC 

T A | B CCCTCCCCGCCCACCCCTCCCCCGCCAA 

T B |i CCCCCCCCCCCCCCCCCCCCCCCCCCCC 

Fig. S6: Two alternative alignments between a biased random 
sequence and a sequence composed of only one letter. See the 
text for details. 



other hand, K{A\B) = K{A). These give I(B; A) align « 
— log 2 ?i bits and I(A; B) a n gn = 0(1). The reason why 
I (A; -B) a iign is not exactly zero is that the two terms on 
the r.h.s. of Eq. (3) are compressed by means of differ- 
ent algorithms. Both have to be given verbatim, so the 
only difference between the two term in Eq. (3) is the 
difference between the overheads in lpaql and XM. 

In summary, I(B; A) align - I(A; B) align = O(logra), as 
expected on general grounds and for the difference be- 
tween the exact MI values. 

b) Assume that A is a random string over the alpha- 
bet {^4, C, G, T} of length n, and B a string of the same 
length consisting of a single letter, say "C" . The optimal 
alignment for this case is shown in Fig. S5. Now T B \ A 
consists of n hyphens followed by n "CP's, which gives 
I{B;A) align w -log 2 n bits. Similarly, T A \ B = A, fol- 
lowed by n hyphens, so that also I(A; B) a \ ign w — log 2 n. 
Thus, for Fig. S5 the difference I(B; A) aVlga -I(A; B) aVlgn 
is again as expected for the exact MI values. 

c) Finally, we can consider a situation similar to case 
b), but with A not fully random. Instead, we assume 
that A is iid. with prob(A 4 = C) > prob(A 4 = A) = 
prob(Ai = G) = prob(A; = T). In this case one might 
be inclined intuitively to prefer alignment #1 of Fig. S6 
over alignment #2. But for alignment #1 both T B \ A 
and T A \ B would be equal to A, up to a complexity pre- 
serving transformation A— »1, C^0, G^3, and T^2. 
Thus I(B:,A) align <x n, while I{A;B) align = 0(1). On 
the other hand, for alignment #2 both I(B; A) a u gn and 
I(A\ -B) a iign are the same as in Fig. S5. Thus alignment 
#2 is the optimal alignment, although one might have 



A ■ GGGAG ATATAGCCGTTATGTGAATAGCACTGAATC AAGCG GCGT 

B ' GGGAC ATATAG TATGTAAATATCACTGAATCGAACGAACACGCACGTGT 

C ' GGGACGGGATATAGCCGTTATGTAA ACTGTATCAAGCGAACACGCACGTGT 

T C | S ,b 00003GGG000000C0000000010 0000300010000A000000000000 

Fig. S7: Alignment of a third sequence C to an already ex- 
isting alignment between A and B. The translation string 
Tc\ab for reconstructing C from A and B is obtained by us- 
ing locally one of the strings A and B as "master strings" and 
applying the rules described in the main part of the paper. 
The actual master string is printed in red. 

preferred alignment =#=1 intuitively. 

5) Up to now, we have assumed that the two DNA 
files contain only the four letters "A" , "C" , "G" , and 
"T" . In reality, the data banks allow also for "wild card" 
letters indicating ambiguities: "N" for any nucleotide, 
"R" for a purine, "Y" for a pyrimidine, etc. Whenever 
either of the two sequences contains such a wild card 
character, we put T B \ Ai = Bi, i.e. the letter in the target 
string is copied verbatim. This will in some cases slightly 
overestimate the conditional information K(B\A). But 
such over estimations are expected to occur anyhow, and 
in the data base we used, wild cards are sufficiently rare 
to have very little effect. 

6) In the case of proteins (i.e. amino acid sequences) 
we have an alphabet of 20 letters. This makes the analy- 
sis a bit more lengthy, although it is basically the same as 
for DNA. The translation strings now contains 41 charac- 
ters: Twenty letters for specifying insertions, one hyphen 
for indicating a deletion, and 20 numbers for indicating 
substitutions. For the latter we have some freedom. As 
for DNA, we could use this freedom to encode forward 
and backward substitutions by the same number. Alter- 
natively, we could encode the 20 amino acids by numbers 
... 19, and encode a substitution j — > k (j, k = . . . 19) 
by the number k — j mod 20. This has the advantage 
of simplicity, but led in the case of DNA to marginally 
worse results. 

7) For local alignments, the output of an alignment al- 
gorithm consists of a list of matching regions, together 
with the actual alignments of those regions. These 
matching regions can in principle overlap. Thus in or- 
der to construct T B i A) one nrs f has to select a subset of 
matching regions A4 k that do not overlap on sequence B. 
Each one of these matches is characterized by its start- 
ing points in sequences A and B and by the translation 
string restricted to the matching regions, 

M k ^{n{ A \n{ B \T B]A , k }. (2) 

The entire translation string consists then of all these 
pieces of information, separated by uniquely decodable 
separators, plus the verbatim description of sequence B 
in regions where no matches were found or used. In- 
deed, if one has the latter verbatim description and all 

translation strings T B \ A ;k for the matching regions, one 

i B \ 

can recover the points n k and does not have to include 
them in T B \ A explicitly. 
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8) As a first step beyond pairwise alignments and pair- 
wise Mis, we shall discuss the estimation of the MI be- 
tween a single sequence C and a pair (A, B) . For exam- 
ple, an application of this could be to estimate the simi- 
larity between mouse and the family of hominids, where 
the latter is characterised by the two species homo and 
chimpanzee. 

Estimating I(C;(A, B)) via concatenation and com- 
pression alone is easy. One just has to modify Eq. (4) 
to 

I(C;(A,B)) compi = len[XM(AB)] + len[XM(C)] 

— len[XM(ABC)] (3) 

where AB and ABC denote the concatenations of A with 
B and of AB with C. This gives I(C; {A,B)) compl > 
I(C;A) compl and I(C; (A, B)) compr > I(C; B) compl , in 
agreement with general relations between Mis and with 
intuition. In contrast, concatenating A and B and then 
aligning (globally!) C with AB would lead to very poor 
estimates of I(C; (A, B)) a n gD . 

Instead, one has to use progressive multiple align- 
ments, where one first aligns the two sequences A and B, 
and then aligns the third sequence C to the fixed align- 
ment of the first two. The central problem is to construct 



a translation string which allows to reconstruct C from 
the aligned pair (A, B). 

One possibility is the following, illustrated in Fig. S7. 
We start at position i = 1 and construct T C \ AB from left 
to right, using at each step one of A or B as the "mas- 
ter sequence". If A is presently the master sequence, 
then Tc\aba = ^cia,*) where Tc\A,i is as described in 
the main paper. Similarly, if B is the master sequence, 
then T c \ AB i = T c \ B i . A sequence (A or B) is kept as 
the master sequence until C[ disagrees with the character 
in the master sequence and is identical to the character 
in the other (non-master) sequence. At this point mas- 
ter and non-master sequences switch their status. In a 
slightly more sophisticated version, one keeps track of 
the number of 'mistakes' made recently by sequences A 
and B, and switches only when the current master has 
done significantly worse over the recent past. We have 
no proof that either encoding is optimal, but both guar- 
antee at least that the "better" of the two sequences A 
and B is used as a template to reconstruct C. 

We have made very preliminary numerical tests show- 
ing that the proposals made in points 6) to 8) are po- 
tentially feasible, but much more complete investigations 
are needed and will be given in future publications. 



